NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Increasing efficiency of SVMp+ for handling missing values in healthcare prediction

https://doi.org/10.1371/journal.pdig.0000281

Zhang, Yufeng; Gao, Zijun; Wittrup, Emily; Gryak, Jonathan; Najarian, Kayvan (June 2023, PLOS Digital Health)
Simsekler, Mecit Can (Ed.)
Missing data presents a challenge for machine learning applications specifically when utilizing electronic health records to develop clinical decision support systems. The lack of these values is due in part to the complex nature of clinical data in which the content is personalized to each patient. Several methods have been developed to handle this issue, such as imputation or complete case analysis, but their limitations restrict the solidity of findings. However, recent studies have explored how using some features as fully available privileged information can increase model performance including in SVM. Building on this insight, we propose a computationally efficient kernel SVM-based framework ( l 2 -SVMp+) that leverages partially available privileged information to guide model construction. Our experiments validated the superiority of l 2 -SVMp+ over common approaches for handling missingness and previous implementations of SVMp+ in both digit recognition, disease classification and patient readmission prediction tasks. The performance improves as the percentage of available privileged information increases. Our results showcase the capability of l 2 -SVMp+ to handle incomplete but important features in real-world medical applications, surpassing traditional SVMs that lack privileged information. Additionally, l 2 -SVMp+ achieves comparable or superior model performance compared to imputed privileged features.
more » « less
Full Text Available
LinCDE: Conditional Density Estimation via Lindsey’s Method

Gao, Zijun; Hastie, Trevor (January 2022, Journal of machine learning research)

Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey’s method (LinCDE). LinCDE admits flexible modeling of the density family and can capture distributional characteristics like modality and shape. In particular, when suitably parametrized, LinCDE will produce smooth and non-negative density estimates. Furthermore, like boosted regression trees, LinCDE does automatic feature selection. We demonstrate LinCDE’s efficacy through extensive simulations and three real data examples.
more » « less
Full Text Available
LinCDE: Conditional Density Estimation via Lindsey's Method

Gao, Zijun; Hastie, T. (January 2022, Journal of machine learning research)

Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey’s method (LinCDE). LinCDE admits flexible modeling of the density family and can capture distributional characteristics like modality and shape. In particular, when suitably parametrized, LinCDE will produce smooth and non-negative density estimates. Furthermore, like boosted regression trees, LinCDE does automatic feature selection. We demonstrate LinCDE’s efficacy through extensive simulations and three real data examples.
more » « less
Full Text Available
Assessment of heterogeneous treatment effect estimation accuracy via matching

https://doi.org/10.1002/sim.9010

Gao, Zijun; Hastie, Trevor; Tibshirani, Robert (April 2021, Statistics in Medicine)

We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo‐observations of the HTE based on matching. Our contributions are three‐fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum‐cost flow problem and provide an efficient algorithm; third, we propose a match‐then‐split principle for the assessment with cross‐validation. We demonstrate the efficacy of the assessment approach using simulations and a real dataset.
more » « less

Search for: All records